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Current methods for population mean estimation from data col- 
lected by Respondent Driven Sampling (RDS) are based on the Horvitz- 
Thompson estimator together with a set of assumptions on the sam- 
pling model under which the inclusion probabilities can be deter- 
mined from the information contained in the data. In this paper, we 
argue that such set of assumptions are too simplistic to be realistic 
and that under realistic sampling models, the situation is far more 
complicated. Specifically, we study a realistic RDS sampling model 
that is motivated by a real world RDS dataset. We show that, for this 
model, the inclusion probabilities, which are necessary for the appli- 
cation of the Horvitz-Thompson estimator, can not be determined by 
the information in the sample alone. An implication is that, unless 
additional information about the underlying population network is 
obtained, it is hopeless to conceive of a general theory of population 
mean estimation from current RDS data. 

1. Introduction. Obtaining useful samples of hidden populations with 
a network structure is a prerequisite for many types of research, especially 
for studies of epidemiological problems such as addiction and HIV/ AIDS. 
Respondent Driven Sampling (RDS) is a recently proposed sampling tech- 
nique that seeks to sample from such hidden populations in a way that allows 
for valid estimation of population quantities. 

RDS begins with a non-random selection of a small set of individuals 
in the target population. These individuals are referred to as seeds. Data 
relevant to the study is first collected from these seeds. In addition, the 
seeds are asked to report their degree. We follow Volz and Heckathorn (2008) 
and define the degree of an individual as the number of people that the 
individual could, in principle, recruit. The seeds are then asked, and provided 
financial incentive, to recruit into the study their social contacts (provided 
the contacts are also in the target population). The sampling continues in 
this way with newly recruited sample members recruiting the next wave of 
sample members until the desired sample size is reached. Whenever a subject 
comes into the study, data relevant to the study are collected from him/her 
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and in addition, his/her degree in the target population is recorded. Further, 
a record of who recruited whom is maintained. 

While RDS has proved to be extremely effective at penetrating hidden 
populations, the statistical dependencies that it induces in the sample make 
the problem of estimating population quantities an intricate task. Let us 
now define basic notation to describe the current employed methods of esti- 
mation from RDS data. Let G denote the population equipped that we wish 
to sample from. We assume that G has a network/graph structure with 
nodes/vertices representing and edges/connections representing extant so- 
cial relationship. In particular, the neighbors of an individual denote the set 
of subjects that the individual can potentially recruit. Let y denote a pop- 
ulation quantity whose mean, /i, we are interested in estimating. A sample 
of size n is drawn via RDS from G. Some of the individuals in the sample 
are selected as seeds while the others are selected through the process of 
recruitment as described above. From each of the sample individuals, data 
on the population quantity y is collected. Also, the population degrees of 
the sample individuals are obtained by enquiry and the information on re- 
cruitment (i.e., who recruited whom) is recorded. The goal is to estimate 
/i using all available information i.e., the y- values of the sample individuals 
yi , . . . , y n , the population degrees of the sample individuals d\ , . . . , d n and 
the information on who recruited whom. 

The current methods of estimation from RDS data, developed mainly 
in Heckathorn (1997), Heckathorn (2002), Salganik and Heckathorn (2004), 
Volz and Heckathorn (2008) and Gile (2010) can all be viewed as being 
based on the Horvitz-Thompson estimator (Horvitz and Thompson, 1952), 
which is a standard estimator in the theory of survey sampling. The Horvitz- 
Thompson estimator for the population mean /x is given by 



where 7Tj is known as the inclusion probability of the i sample individual 
and is defined as the probability that the i th sample individual is included 
in the sample. This estimator is applicable to both with replacement and 
without replacement sampling models. 

It should be noted that the Horvitz-Thompson estimator is applicable only 
to probability sampling schemes where the inclusion probabilities tt\ , . . . , 7r n 
of the sample individuals can be determined. In the papers cited above, the 
authors place a variety of assumptions on the RDS data generation process 
and argue that, under those assumptions, the sample obtained according to 
RDS can be taken to be a probability sample and that, as required by the 
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Horvitz-Thompson estimator, that the inclusion probabilities of the sample 
individuals can be determined as a function of their self-reported degrees. 
We shall describe here the methods of Volz and Heckathorn (2008) and 
Gile (2010). The estimator in Volz and Heckathorn (2008) is termed RDS 
II and, as illustrated in Gile and Handcock (2010), is an improvement over 
the classical RDS estimation procedure developed in Heckathorn (1997), 
Heckathorn (2002) and Salganik and Heckathorn (2004). As a result, these 
earlier estimators no longer need to be considered. 

Volz and Heckathorn (2008) assume that respondents recruit uniformly 
at random from their network neighbors (this ensures that RDS is a prob- 
ability sampling model). In addition, they also make the following pair of 
assumptions: 

1. Samples are drawn with-replacement 

2. Each respondent recruits exactly one other respondent into the study. 

Volz and Heckathorn (2008) argue that under these assumptions, the process 
of obtaining an RDS sample is equivalent to performing a random walk on 
the population network. They then assert, based on convergence properties 
of Markov chains, that the inclusion probability, 7Tj, of the ith. individual 
in the sample should be directly proportional to his/her degree, d%. They 
therefore construct their estimator for the population mean \i by replacing 
7Tj in the formula (1) by di. This leads to the intuitively appealing and 
mathematically uncomplicated estimator RDS II where the population mean 
is estimated by the weighted average of the sample observations, the weights 
being inversely proportional to the self-reported population degrees of the 
sample individuals. 

The problem, however, with RDS II is that it is founded on the two 
assumptions 1 and 2 which are both routinely violated in practice. Assump- 
tion 1 which implies that any individual may be recruited into the sample 
more than once, is never allowed. Assumption 2 can be relaxed (see Salganik 
and Heckathorn (2004, pp. 210)) to the case where all respondents recruit 
an equal number of respondents. In practice, however, different respondents 
recruit differently (we shall refer to this as differential recruitment in the 
sequel) and this aspect is not taken into account by Volz and Heckathorn. 
More details on the violation of these assumptions in real-world sampling 
can be found in Heimer (2005). 

Regrettably, when assumptions 1 and 2 are violated, the underlying the- 
oretical foundation for RDS II breaks down making its role as a population 
mean estimator unsound and suspect. Indeed, if the samples are drawn with- 
out replacement, then the recruitment process is no longer Markov because, 
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in standard and near-universal practice, it is not allowed to re-recruit an 
individual who is already in the sample (however distant past he/she may 
have been recruited in) and thus the process is forced to have memory. 

The estimation procedure of Gile (2010) is much more elaborate com- 
pared to RDS II. Gile (2010) assumed that the underlying data collection 
process in RDS can be modeled as a successive sampling process. Under the 
assumption of successive sampling, Gile described an iterative algorithm for 
approximating the inclusion probabilities 7Tj based on the information con- 
tained in the RDS sample alone. The algorithm, which can be considered as 
a variant of the Expectation-Maximization (EM) algorithm (see Gile (2010, 
pp. 12)), further assumes that the population size is known and also makes 
an assumption on the graph structure of the true population (a variant of 
the configuration model for networks). 

Just like the assumptions of Volz and Heckathorn, the assumption of 
successive sampling is also not realistic and would not be a reasonable ap- 
proximation to most real-world RDS data collection processes. For example, 
although it allows for without-replacement sampling, it still does not allow 
for differential recruitment. Unfortunately, the estimator of Gile (2010) cru- 
cially depends on the assumption of successive sampling and it is not at 
all clear as to how it can be extended to real-world situations where the 
assumptions of successive sampling do not hold. 

To illustrate the divergence of the assumptions of Volz and Heckathorn 
(2008) and those of Gile (2010) with real world RDS sampling, we will re- 
view a recent study of the HIV epidemic in St. Petersburg, Russia. A primary 
objective of The Sexual Acquisition and Transmission of HIV Cooperative 
Agreement Project (SATHCAP) study was to estimate certain population 
means (including the prevalence of HIV and hepatitis C) in the population 
of Injection Drug Users (IDUs) in St. Petersburg. In the first cycle of recruit- 
ment conducted from November 2005 through December 2006, a sample of 
373 IDUs was obtained using an RDS design and data on variables relevant 
to the study (for example, HIV status, Hepatitis C status etc) were collected 
from the subjects. In addition, their self-reported population degrees were 
recorded. The histogram of these self-reported degrees in the SATHCAP 
dataset is given in Figure 1. 

The SATHCAP sample was collected without replacement and different 
individuals recruited differently even though each individual received the 
same fixed number of coupons. We have displayed this differential recruit- 
ment pattern in Table 1. It is clear from Table 1 that the assumptions of 
Volz and Heckathorn (2008) and those of Gile (2010) are violated for the 
SATHCAP dataset. 
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Fig 1: Histogram of the self-reported degrees in the SATHCAP dataset. 



The goal of this paper is to argue that for realistic RDS sampling models, 
it is impossible to determine the inclusion probabilities 7Q , . . . , 7r n from the 
information available in an RDS sample. The determination of inclusion 
probabilities is possible only under highly simplistic modeling assumptions 
such as those made by Volz and Heckathorn (2008) or Gile (2010). 

In section 2 of this paper, we shall describe a realistic RDS sampling 
model, to be denoted by M, that will account for both differential recruit- 
ment and sampling without replacement, thus creating a more faithful ap- 
proximation to the data collection process in the SATHCAP study. The 
model Ai is conceptually uncomplicated, although it is more elaborate than 
currently studied RDS models since both without replacement sampling and 
differential recruitment are assumed. Two main features of M are: (a) re- 
spondents recruit only from those of their neighbors who are not already 
in the sample, and (b) instead of recruiting exactly one other subject, each 
respondent recruits no other subject with a certain probability, exactly one 
other subject with a certain probability and so on. We assume that these 
probabilities (which will be parameters in the model) are the same for all 
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Percentage of Subjects 


Number of Recruits 


58.98 





15.55 


1 


10.19 


2 


9.65 


3 


3.49 


4 


1.34 


5 


0.8 


6 



Table 1 

Observed Recruitment Pattern in SATHCAP. 58.98% of the sample individuals did not 
recruit others into the study (in spite of receiving coupons), 15.55% of the individuals 
recruited exactly one other individual into the study etc. 

the respondents. We refer the reader to section 2 for a full description of the 
model. 

In order to use the Horvitz-Thompson estimator for estimating fj, from 
data generated according to the model Ai, it is necessary to determine the 
inclusion probabilities tti , . . . , ir n under Ai . The behavior of the inclusion 
probabilities under Ai is quite detailed involving both individual degrees 
and the overall network structure. In section 3, we demonstrate that, in 
contrast to the RDS models studied so far in the literature, the model Ai 
is such that the inclusion probability for an individual depends not only 
on his/her degree but also on the network structure of the population. We 
present two simulated population networks having identical degrees but very 
different inclusion probabilities under our sampling model. 

Our example demonstrates that the inclusion probabilities tti, . . . , 7T n for 
a realistic RDS model depend not only on the population degrees of sam- 
ple individuals, d\, . . . ,d n but also on other characteristics of the underlying 
population graph. Since the self-reported degrees di, . . . , d n present the only 
information about the population network that is contained in an RDS sam- 
ple, we deduce that the inclusion probabilities 7Ti,.. . ,7r n for realistic RDS 
models can not be calculated from the RDS sample. This precludes the appli- 
cation of the Horvitz-Thompson estimator as a population mean estimator 
from RDS data. As a result, a general estimation theory is not possible from 
RDS data. 

The situation can only be remedied if one obtains additional information 
about the underlying population through an RDS sample. We have some 
incomplete ideas in this regard that we describe in Section 4. We organize 
concepts as follows: in the next section, we describe our RDS sampling model 
Ai. In Section 3, we show that inclusion probabilities under Ai depend not 
only on the degrees but also on the structure of the underlying population 
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network. We finish the paper summarizing our conclusion and describing 
some ideas for future work. 

2. A Realistic Respondent Driven Sampling Model. In this sec- 
tion, we describe a new model for RDS, to be denoted by M, that allows 
for both without replacement sampling and differential recruitment. As ex- 
plained in the previous section, it is inspired from the underlying data col- 
lection mechanism in the SATHCAP study. 

Our model M is conceptually simple. We consider nonnegative real num- 
bers Po,Pi,P2, ■ ■ ■ that sum to 1 and instead of supposing that every respon- 
dent recruits exactly one other subject, we assume that each respondent 
decides to recruit no other subject with probability po, exactly one subject 
with probability p±, exactly two subjects with probability p2 and so on. This 
would clearly permit differential recruitment. Suppose that a respondent de- 
cides to recruit s± subjects and suppose that the number of his/her neighbors 
that are not already in the sample is S2- We then assume that the respon- 
dent actually recruits min(si, S2) subjects uniformly at random from among 
his/her neighbors who are not already in the sample. As a consequence of 
this assumption, respondents recruit only from those of their neighbors that 
are not already in the sample. Therefore, M. rules out subjects reappearing 
in the sample and hence, is a without-replacement sampling model. 

We assume that seeds are chosen uniformly at random from all the pop- 
ulation members who are not already in the sample. The other assumption 
that is commonly made concerning seed selection in standard RDS analy- 
sis is that seeds are chosen with probabilities proportional to degrees. It is 
debatable as to which assumption (uniformly at random or random with 
probabilities proportional to degrees) is more reasonable. We also assume 
that the sampling starts with one seed and that new seeds are selected only 
when necessary i.e., only when recruiting stops. 

The following is the complete description of the sampling model in the 
form of a randomized algorithm. The set active represents active/potential 
recruiters i.e., the sample individuals who currently possess coupons and who 
can therefore potentially recruit other individuals into the sample. When 
there are multiple active recruiters, we assume that one of them (uniformly 
at random) recruits first. 

1. Initially there are no active recruiters. So we initialize active to be 
the empty set. The following steps are then repeated till the desired 
sample size n is reached. 

2. If active is empty, we need to select a seed. The seed is chosen from 
the set of all population nodes that are not already in the sample and 
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the seed is included in both the sample and the set active. 
3. If active is non-empty, i.e., if there are active recruiters, the following 
steps take place 

(a) One node is chosen uniformly at random from active, this respon- 
dent recruits first. Let available be the set of all neighbors of this 
node that are NOT already in the sample and let s\ be the size 
of available. 

(b) A number S2 from 0, 1, . . . is chosen with probabilities Po,Pi, 

Then min(si, S2) nodes are chosen from available. These nodes are 
included in the sample and in the set active. Also the recruiter is 
deleted from active. 

This completes our description of A4. For every sampled individual, we col- 
lect data on the study variables and his/her degree (self-reported). The 
probabilities po,Pi, ■ ■ ■ are parameters in M.. 

In the next section, we demonstrate that for A4, the inclusion probability 
of an individual depends crucially on the population graph structure and 
that the inclusion probabilities can not be determined by the degrees alone. 

3. Inclusion probabilities under the model J\A. In this section, 
we argue that, for the sampling model Ai, the inclusion probabilities of 
individuals depend not only on their degree but also on the network structure 
of the true population. We proceed by constructing two population networks 
Gi and G2 having the same nodes (say, N, of them) and having identical 
degrees but vastly different inclusion probabilities under the model Ai. We 
would like to stress that the two networks G\ and G2 have identical degrees; 
not just identical degree distributions. In other words, both G\ and G2 
have the same set of nodes, which can be numbered 1, . . . , N, and, for each 
i = 1, . . . , N, the degree of the ith vertex is exactly the same in both G\ 
and G2. 

We base our construction of the networks G\ and G2 on the degrees 
reported by the 373 individuals in the SATHCAP sample (see Figure 1). We 
take these 373 self-reported degrees and resample from this set uniformly 
with replacement to create a set of N integers, where N > 373 will be 
specified shortly. Let us denote this set of N integers by D\, . . . ,Dn and 
we assume that D\ > D2 > ■ ■ ■ > Dpj. A standard result (see, for example, 
Sierksma and Hoogeveen, 1991) asserts that there exists a graph on N nodes 
with degrees Di,..., Djy (in which case, D\, . . . , Dn is known as a graphical 
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sequence) provided the following criterion is satisfied: 

k N 

(2) J^max(A- A; + 1,0) < ^ A for every k = 1, . . . ,N — 1. 

i=l i=k+l 

Now because the integers A, ■ • • ) Av have been constructed from the SATH- 
CAP self- reported degrees (which have the histogram given in Figure 2), the 
histogram of A, . . . , Av will also approximately (at least when N is large) 
be as in Figure 2 (with just a change of scale on the y-axis). As a result, it 
is not hard to see that, when N is large, the probability that there exists a 
graph with degrees Aj • • • , Av is quite high (mainly because the right hand 
side of (2) increases with ./V while the left hand side remains unchanged). 
We choose N = 5000 and, at least in simulations, this invariably resulted in 
a graphical sequence A, • • • > Av- 

For such a fixed graphical sequence A , • • • , A\r, we construct networks 
G\ and G2 having N nodes with degrees A, • • • , Av- Let us first describe 
the construction of G%. G\ is a deterministic graph with degrees A, • • • , Av- 
We take the algorithm for the construction of G± from Raman (1991). The 
algorithm starts with the empty graph (the graph on N nodes with no 
edges) and adds edges until the ith node has degree A for alH = 1, . . . , N. 
At any stage of the algorithm, the residual degree of the zth node is defined 
as the difference of A and its current degree. The algorithm proceeds by 
repeatedly joining the node with the largest residual degree (say, k) to the 
k nodes with the next largest residual degrees. It can be shown, see Raman 
(1991), that, whenever A, • • • , Av is graphical, this algorithm results in a 
graph with degrees A, • • • , Av- From the construction, it is easy to see that 
G\ is a deterministic graph in which the high degree nodes have a tendency 
to connect to other high degree nodes. In other words, there is a homophily 
(the tendency of individuals to associate with those similar to themselves) 
by degree in the network G\. 

Let us now explain the construction behind G2 for which we take the 
randomized algorithm of Bayati, Kim and Saberi (2007, pp. 329, Procedure 
A). This algorithm also starts with the empty graph and sequentially adds 
edges between pairs of non- adjacent nodes until the ith node has degree A 
for alH = 1, . . . , N. Unlike Raman's algorithm however, this is a probabilistic 
algorithm and at every step, the probability that an edge is added between 
two non-adjacent nodes i and j is proportional to 

( A A \ 

m V " 2(A + --- + Av)J ' 
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where Ri is the residual of node i. Bayati, Kim and Saberi (2007) showed 
that, provided that D% (which is the maximum of D\, . . . , -Djv) is sufficiently 
small compared to D\ + • • • + Djy, this algorithm produces a random graph 
with degrees D\, . . . , with high probability and, moreover, the distribu- 
tion of this random graph will be approximately uniform on the set of all 
graphs with degrees D\, . . . , Djy. This condition on D%, . . . , Djy is satisfied 
in our case and we can thus view G2 as one realization of a random graph 
with degrees D±, . . . , Djy whose distribution is approximately uniform over 
all graphs with degrees D\ , . . . , D/y . 

We have thus created two networks G\ and G2 on N nodes having identi- 
cal degrees D\, . . . , -Djv- However, these two networks are very different from 
each other. G\ is a deterministic graph with a strong homophily by degree. 
On the other hand, G2 is one realization of a random graph that is uniformly 
distributed over all graphs with degrees D%, . . . , Dm- We now show that the 
inclusion probabilities of the nodes 1, . . . , N for a sample of size n = 373 
(we use n = 373 because that was the sample size in the SATHCAP study) 
drawn according to the realistic RDS sampling model A4 are vastly different 
for the two networks G\ and G2. 

From each of G\ and G2, we obtained 10000 samples each of size n = 373 
using the model M. with parameters po = 0.5898, p\ = 0.1555, P2 = 0.1019, 
p 3 = 0.0965, p A = 0.0349, p 5 = 0.0134, p 6 = 0.008 and p 7 ,p s ,--- are all 
equal to 0. These specific values for the probabilities were chosen from the 
observed recruitment proportions in the SATHCAP dataset (Table 1). Us- 
ing these 10000 samples, the inclusion probability of any individual in the 
population under the model Ai (for the chosen parameters) can be well- 
approximated by the proportion of samples containing the individual. These 
inclusion probabilities are plotted against degrees for each of the two net- 
works G\ and G2 in Figure 2. It is clear from Figure 2 that the inclusion 
probabilities for G\ and G2 are quite and meaninfully distinct (the scale of 
the two plots is exactly the same) even though G\ and G2 have identical 
degrees. The inclusion probabilities for individuals in G\ are roughly propor- 
tional to the square root of their degrees. On the other hand, the inclusion 
probabilities in G\ are nearly directly proportional to degrees. Therefore, 
the inclusion probabilities for the model A4 depend not only the degrees 
but also on the network structure of the true population. 

4. Conclusion and Future Work. In contrast with earlier works on 
respondent driven sampling, we presented a realistic RDS model, motivated 
by a real world RDS dataset. We demonstrated that for this model, the 
inclusion probabilities of sample individuals does not depend on their de- 
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Fig 2: For each of the two networks G\ and G2, the inclusion probabilities of all 
population individuals under the model M. are plotted against degrees. The 
scale is exactly the same in both the plots. 



grees alone. Indeed, we constructed two population networks G\ and G2 
having identical set of nodes and degrees but with vastly different inclu- 
sion probabilities. This implies that, for this RDS model, it is impossible 
to construct a general estimation theory for population means based on the 
Horvitz-Thompson estimator. 

For estimation under such a realistic RDS model to be feasible, we either 
need more information about the underlying population network or we need 
to be content in only estimating certain but not all population means. In 
this regard, we have the following ideas which we hope to explore in future 
work: 

1. Under assumptions in Volz and Heckathorn (2008), the inclusion prob- 
abilities of sampled individuals are proportional to their degrees and 
hence they depend solely on the degrees up to the constant of pro- 
portionality. Heckathorn (1997) realized that the researcher can learn 
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the degrees of individuals by just asking them and since then, sampled 
individuals are always asked to report their degrees in RDS studies. 
In the case of the model M however, we have seen that the inclusion 
probabilities depend not only on the degrees but also on the network 
structure of the true population. It is of interest to understand the 
precise network characteristics on which the inclusion probabilities de- 
pend and whether such characteristics can be learned by asking the 
sampled subjects additional questions. Recent work by Cepeda et al 
(2011) suggests that additional information on network structure such 
as total network size and stability would enhance HIV prevalence esti- 
mates and assist prevention measures in a population of injection drug 
users. These types of questions could yield enough information about 
the underlying network for the purpose of approximating inclusion 
probabilities. 

2. We have studied two network models in this paper; the ones that were 
involved in the construction of G\ and G^. It will be of interest to 
study more such network models and to investigate the behavior of 
inclusion probabilities under them. 

3. It will be of interest to explore alternative RDS estimators that are 
not based on the Horvitz-Thompson estimator. Such estimators may 
not have the general applicability of the Horvitz-Thompson estimator 
but may work in certain special instances. For example, suppose that 
the population quantity whose mean we are interested in estimating is 
distributed as a Bernoulli random variable with probability of success 
equal to p and suppose that p is independent of all features of the 
network. In that case, it should be clear that the sample mean is as 
good as any other estimator. In other words, if the quantity of interest 
does not depend on the network features, then the sample mean is an 
adequate estimator. Drawing from this intuition, it is reasonable to 
believe that for population quantities that do not depend significantly 
on the degrees, one might not need to use the Horvitz-Thompson esti- 
mator as simpler estimates based perhaps on the sample mean might 
be adequate. Making such an idea rigorous will require more work. 
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